production for you to teach someone else what you think the key lessons are
Exploration vs. production
When you are in exploration mode, you will look at lots of patterns and your brain filters out the noise
Production mode is like putting a cone on your dog.
You are deliberately limiting the reader’s field of vision such that they see the key messages from the plot and avoid too many distractions
“A Sunday on La Grande Jatte” by Seurat
“A Sunday on La Grande Jatte” by Seurat
Headline vs submessages
Headline is what you see first when you look at the painting or you look at the plot.
Submessages are what you see later, on closer inspection
Often, the most interesting patterns in the data are ones that you don’t see right away when you make the very first plot.
Iterating on plot design
“Make dozens of plots”
Quoctrung Bui, former 30538 guest lecturer and former Harris data viz instructor
Iterating on plot design
What does he mean?
The first plot you make will never be the one you should show
As you are generating graphs for yourself in exploration mode, you will produce many candidates that could end up being used in production mode
As a rule of thumb, you should try out at least three different plotting concepts (marks)
Within each concept, you will need to try out several different encodings
Summary:
Decide if you are trying to explore the data or produce a plot for someone else
For any given plot, look closely (like Seurat) beyond just the headline
Iterate
Intro to data
Introduction to data
Most of our visualization lectures are based on the University of Washington textbook, but the textbook doesn’t have enough material on exploratory data analysis. So we are supplementing with
cut
Fair 1610
Good 4906
Very Good 12082
Premium 13791
Ideal 21551
dtype: int64
Summarizing with a bar graph
diamonds_cut = diamonds_cut.reset_index().rename(columns={0:'N'}) alt.Chart(diamonds_cut).mark_bar().encode( alt.X('cut:O', title ="Cut"), alt.Y('N:Q', title ="Count"))
Categorical variables: summary
This section is very brief because there’s basically only one good way to plot categorical variables with a small number of categories and this is it.
You can use mark_point() instead of mark_bar(), but overall, there’s a clear right answer about how to do this.
We include this material mainly to foreshadow the fact that we will do a lot on categorical variables in the next lecture when we get to “Exploring Co-variation”
Continuous Variables
Continuous variables: roadmap
Binning + histograms using movies
Histograms and density plots using penguins
Exploring carat size using diamonds
Remark: The skills are absolutely fundamental and so we will intentionally be a bit repetitive.
Rotten Tomatoes ratings are determined by taking “thumbs up” and “thumbs down” judgments from film critics and calculating the percentage of positive reviews.
This is a continuous measure, but we can bin it to create a histogram of frequencies
Histogram using mark_bar()
hist_rt = alt.Chart(movies_url).mark_bar().encode( alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20), title ="Rotten Tomatoes Rating (%)"), alt.Y('count():Q', title ="Count"))hist_rt
Discussion question: what are the headline and sub-messages?
Histogram of IMDB ratings
IMDB ratings are formed by averaging scores (ranging from 1 to 10) provided by the site’s users.
hist_imdb = alt.Chart(movies_url).mark_bar().encode( alt.X('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20), title ="IMDB Ratings"), alt.Y('count():Q', title ="Count"))hist_imdb
Side-by-side
hist_rt | hist_imdb
Discussion question: compare the two ratings distributions. If your goal for the headline of the graph to be about differentiating between good and bad movies, which is more informative?
We previously picked the maximum number of equally-spaced bins (BinParams(maxbins=20)) and let altair choose “nice”-looking bin widths for the histogram
Alternatively, we can manually control the bin width using step
Histogram with steps of 200
alt.Chart(penguins).mark_bar().encode( alt.X('body_mass_g:Q', bin=alt.BinParams(step=200), title ="Body Mass (g)"), alt.Y('count():Q', title ="Count"))
Histogram step parameter
step=20 vs. step=200 vs. step=2000
Discussion question: what message(s) come from each binwidth choice? Which do you prefer?
Density plot
An alternative to a histogram for exploring frequency in continuous variable: density plot using transform_density
alt.Chart(penguins).transform_density('body_mass_g', as_=['body_mass_g', 'density']).mark_area().encode( alt.X('body_mass_g:Q', title ="Body Mass (g)"), alt.Y('density:Q', title ="Density"))
Back to diamonds, focus on carat
alt.data_transformers.disable_max_rows() alt.Chart(diamonds).mark_bar().encode( alt.X('carat', bin=alt.Bin(maxbins=10), title ="Carat"), alt.Y('count()', title ="Count"))
First line disables altair’s maximum row limit (5,000)
Histogram of carat
diamonds_small = diamonds.loc[diamonds['carat'] <2.1]alt.Chart(diamonds_small).mark_bar().encode( alt.X('carat', bin=alt.BinParams(step=0.2), title ="Carat"), alt.Y('count()', title ="Count"))
In-class exercise: histogram of carat
alt.Chart(diamonds_small).mark_bar().encode( alt.X('carat', bin=alt.BinParams(step=0.02), title ="Carat"), alt.Y('count()', title ="Count"))
Discussion questions
What is the headline of this plot? Submessages?
What questions does it raise?
Typical continuous variables: summary
Main tool to explore uni-dimensional continuous variables: histograms
Varying the bin widths can reveal different patterns
Continuous variables: unusual values
Unusual continuous variables: roadmap
case study: y dimension in diamonds
explore some unusual values
three options for handling unusual values
diamonds: identify unusual y values
First pass to examine for unusual values: summary statistics
diamonds['y'].describe()
count 53940.000000
mean 5.734526
std 1.142135
min 0.000000
25% 4.720000
50% 5.710000
75% 6.540000
max 58.900000
Name: y, dtype: float64
The paper is trying to quantify how much earnings change from month to month for the typical US worker.
Consider the following fake data (next slide)
Toy winsorization example
Suppose we have observations for earnings changes. 99% of the data follows a normal distribution with std. dev. 0.2 and 1% of the data is extremely large changes
last month ($)
this month ($)
% change
|% change |
600
600
0%
0%
600
570
-5%
5%
600
540
-10%
10%
600
630
5%
5%
…
(99% of sample)
…
600
300
-50%
50%
6000
300
-95%
95%
300
600
100%
100%
300
6000
1900%
1900%
What is the standard deviation of the % change in earnings?